1 Statistics and Machine Learning

This session starts with basic concepts of statistics and machine learning models. Traditionally, we use statistics to estimate models and test hypotheses using data. We can collect data, describe variables such as averages, variances and distributions and then explain relationships between two or more variables. The roots of statistics lie in working with data and checking theory against data (Breiman 2001). This data model replies heavily on the theory and hypotheses to first generate or collect data to test the model, in an attempt to explain the relationship before prediction. Machine learning takes on a different approach. It focuses on prediction by using algorithmic models without first developing a theory then hypothesis. UC Berkeley Statistics professor Leo Breiman compares the two cultures of statistics in his famous 2001 paper and explains the difference. He argues that with so much data coming in from all sources and directions, using the data model approach alone may not be most effective in making the best use of data to solve the problem. He suggests employing algorithmic models to improve prediction. Algorithms refer to a sequence of computational and/or mathematical programs to solve the problem. The goal of algorithmic models is to identify an algorithm that operates on predictor variables (x) to best predict the response variable (y).

1.1 What is machine learning?

The ultimate goal of data modeling is to explain and predict the variable of interest using data. Machine learning is to achieve this goal using computer algorithms in particular to more effectively make the prediction and solve the problem. According to Carnegie Mellon Computer Science professor Tom M. Mitchell, machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience. To statisticians, the “improve through experience” part is the process of validation or cross validation. Learning can be done through repeated exercises to understand data. This involves having computer or statistics programs do repeated estimations, like human learns from experience and improve actions and decisions. This is called the training process in machine learning.

“A computer algorithm/program is said to learn from performance measure P and experience E with some class of tasks T if its performance at tasks in T, as measured by P, improves with experience E.” -Tom M. Mitchell.

Tom M. Mitchell



Dimensionality of data


Example: Fourth dimension of data


1.1.1 Hands-on workshop: Data programming

2 Data Programming

This session starts with basic principles for data programming or coding involving data. Data programming is a practice that works and evolves with data. Data programming or coding allows the user to manage and process data in more effective manner. Programs are designed to be replicated or replicable by user and collaborators. A data program can be developed and updated iteratively and incrementally. In other words, it is building on the culminated works without repeating the steps. It takes debugging, which is the process of identifying problems (bugs) but, in fact, updating the program in different situations or with different inputs when used in different contexts, including the programmer himself or herself working in future times.

2.1 Principles of Programming

Social scientists Gentzkow and Shapiro (2014) list out some principles for data programming.

  1. Automation
  • For replicability (future-proof, for the future you)
  1. Version Control
  • Allow evolution and updated edition
  • Use Git and GitHub
  1. Directories/Modularity
  • Organize by functions and data chunks
  1. Keys
  • Index variable (relational)
  1. Abstraction
  • KISS (Keep in short and simple)
  1. Documentation
  • Comments for communicating to later users
  1. Management
  • Collaboration ready

2.2 Functionalities of Data Programs

A data program can provide or perform :

  1. Documentation of data
  2. Importing and exporting data
  3. Management of data
  4. Visualization of data
  5. Data models

2.3 Data programming in R

R basics

# Create variables composed of random numbers
x <-rnorm(50) 
y = rnorm(x)

# Plot the points in the plane 
plot(x, y)

2.3.1 Using R packages

# Plot better, using the ggplot2 package 
## Prerequisite: install and load the ggplot2 package
## install.packages("ggplot2")
library(ggplot2)
qplot(x,y)

2.4 Data Visualization with R

# Plot better better with ggplot2
x <- rnorm(50) 
y = rnorm(x)
ggplot(,aes(x,y)) + theme_bw() + geom_point(col="blue")

Taiwan Election and Democratization Study 2016 data

Taiwan Election and Democratization Study (TEDS) is one of the longest and most comprehensive elections studies starting in 2001. TEDS collects data through different modes of surveys including face-to-face interviews, telephone interviews and internet surveys. More detail of TEDS can be found at the National Chengchi University Election Study Center website at https://esc.nccu.edu.tw/main.php.

# Import the TEDS 2016 data in Stata format using the haven package
##install.packages("haven")

library(haven)
TEDS_2016 <- haven::read_stata("https://github.com/datageneration/home/blob/master/DataProgramming/data/TEDS_2016.dta?raw=true")

# Prepare the analyze the Party ID variable 
# Assign label to the values (1=KMT, 2=DPP, 3=NP, 4=PFP, 5=TSU, 6=NPP, 7="NA")

TEDS_2016$PartyID <- factor(TEDS_2016$PartyID, labels=c("KMT","DPP","NP","PFP", "TSU", "NPP","NA"))

Take a look at the variable:

# Check the variable
attach(TEDS_2016)
head(PartyID)
## [1] NA  NA  KMT NA  NA  DPP
## Levels: KMT DPP NP PFP TSU NPP NA
tail(PartyID)
## [1] NA  NA  DPP NA  NA  NA 
## Levels: KMT DPP NP PFP TSU NPP NA

Frequency table:

# Run a frequency table of the Party ID variable using the descr package
## install.packages("descr")
library(descr)
freq(TEDS_2016$PartyID)

## TEDS_2016$PartyID 
##       Frequency  Percent
## KMT         388  22.9586
## DPP         591  34.9704
## NP            3   0.1775
## PFP          32   1.8935
## TSU           5   0.2959
## NPP          43   2.5444
## NA          628  37.1598
## Total      1690 100.0000

Get a better chart of the Party ID variable:

# Plot the Party ID variable
library(ggplot2)
ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar()

We can attend to more detail of the chart, such as adding labels to x and y axes, and calculating the percentage instead of counts.

ggplot2::ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties")

Adding colors, with another theme:

ggplot2::ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw()

Hold on, colors are not right!

ggplot2::ggplot(TEDS_2016, aes(PartyID)) + 
  geom_bar(aes(y = (..count..)/sum(..count..),fill=PartyID)) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw() +
  scale_fill_manual(values=c("steel blue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))

To make the chart more meaningful, we can use a package called tidyverse to manage the data.

##install.packages("tidyverse")
library(tidyverse)
TEDS_2016 %>% 
  count(PartyID) %>% 
  mutate(perc = n / nrow(TEDS_2016)) -> T2
ggplot2::ggplot(T2, aes(x = reorder(PartyID, -perc),y = perc,fill=PartyID)) + 
  geom_bar(stat = "identity") +
  ylab("Party Support (%)") + 
  xlab("Taiwan Political Parties") +
  theme_bw() +
  scale_fill_manual(values=c("steelblue","forestgreen","khaki1","orange","goldenrod","yellow","grey"))

2.6 References:

Graham Williams 2011. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge